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A Class of Parallel Decomposition Algorithms for 

SVMs Training* 

Andrea Mannoj Laura Palagi^and Simone Sagratella^ 


Abstract 

The training of Support Vector Machines may be a very difficult task 
when dealing with very large datasets. The memory requirement and the 
time consumption of the SVMs algorithms grow rapidly with the increase 
of the data. To overcome these drawbacks, we propose a parallel decom¬ 
position algorithmic scheme for SVMs training for which we prove global 
convergence under suitable conditions. We outline how these assumptions 
can be satisfied in practice and we suggest various specific implementa¬ 
tions exploiting the adaptable structure of the algorithmic model. 

Keywords. Decomposition Algorithm, Big Data, Support Vector Machine, 
Machine Learning, Parallel Computing 


1 Introduction 

A Support Vector Machine (SVM) is a well known classification and regression 
tool that has spread in many scientific fields during the last two decades, see [5] . 
Given a training set of n input-target pairs 

D = {(zr, Vr), r = 1,..., n, e R"*, £ {-1,1}}, 

an SVM provides a prediction model used to classify new unlabeled samples. 
The dual formulation of an SVM training problem is 

min /(x) := -x^Qx — e^x 

X 2 

y^x = 0 (1) 

0 < X < Ce, 

where x G K", e G R” is a vector of all ones, C > 0 is a positive constant, 
y G {—1,1}" and Q is an n x n symmetric positive semidefinte matrix. Each 
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component of x is associated with a sample of the training set and y is the 
vector of the corresponding labels. Entries of Q are defined by 

Qrq — yrVq^i^r^ ~ 1 ; ^5 ■ • ■ 5 ^7 

where K : R'” x R™ —?> R is a given kernel function [23] • 

Many real SVM applications are characterized by a large dimensional training 
set. This implies that hessian matrix Q is so big that it cannot be entirely stored 
in memory. For this reason classical optimization algorithms that use first and 
second order information cannot be used to efficiently solve problem ©• 

To overcome this difficulty, many decomposition algorithms have been proposed 
in literature. At each iteration, they split the original problem into a sequence 
of smaller subproblems where only a subset of variables (working set) are up¬ 
dated. Columns of the hessian submatrix corresponding to each subproblem are, 
partially or entirely, recomputed at each step. These strategies can be mainly 
divided into SMO (Sequential Minimal Optimization) and non-SMO methods. 
SMO algorithms (see e.g. [3II3I]) work with subproblems of dimension two, so 
that their solutions can be computed analytically; while non-SMO algorithms 
(see e.g. [ini[i3]) need an iterative optimization method to solve each subprob¬ 
lem. From the theoretical point of view, the policy for updating the working 
set plays a crucial role to prove convergence. In case of SMO methods, a proper 
selection rule based on the maximal violating principle is sufficient to ensure 
asymptotic convergence of the decomposition scheme |4llT6] . For larger working 
sets, convergence proofs are available under further conditions [15111^120] . 

In recent years SVMs have been applied to huge datasets, mainly related 
to web-oriented applications. To reduce the big amount of time needed for 
the training of SVMs on such huge datasets, parallel algorithms have been 
proposed. Some of these parallel approaches to SVMs consists in distributing 
the most expensive tasks, such as subproblems solving and gradient updating, 
among the available processors, see [IllllillZj. Another way of fruitfully exploit 
parallelism is based on splitting the training data into subsets and distributing 
them among the processors llllHj. Among these parallel techniques, there are 
also the so called Cascade-SVM (see [121123]) that has been introduced to face 
big dimensional instances. While achieving a good reduction of the training time 
respect to sequential methods, these methods may lack convergence properties 
or may require strong assumptions to prove it. 

Actually, combining decomposition rules, for the selection of working sets, 
and parallelism makes the proof of convergence a very difficult task, see m- 
This is mainly due to nonseparability of the feasible set of problem ©• 

In this work we propose a class of convergent parallel training algorithms 
based on the decomposition of problem © into a partition of subproblems that 
can be solved independently by parallel processes. The convergence to a global 
optimum of problem © is proved under realistic assumptions. It partially 
exploits results introduced in [3123] • The algorithmic framework presented may 
include, as a special case, other convergent theoretical models like [18]. 

The paper is organized as follows: in Section[2]we introduce some preliminary 
results; in Section [3] we introduce a general parallel algorithmic scheme. We 
analyze its convergence properties in Sections [3113] and [S] in Section [7] we discuss 
about some possible practical implementations. 
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Notation In the following we use this notation. Vectors are boldface. Given 
a vector x G R" with components Xr and a subset of indices P C {1,... ,n} 
we denote by xp G the subvector made up of components Xr with r G P 
and by x_p G R"“l^l the subvector made up of components Xr with r ^ P. By 
II • II we indicate the euclidean norm, whereas the zero norm of a vector ||x||o 
denotes the number of nonzero components of vector x. Further given a square 
n X n matrix Q, we denote by the r—th column of the matrix. Given two 
subsets of indices Pr,Pq C {1,..., n}, we write Qp^p^ to indicate the \Pr- \ x \Pq\ 
submatrix of Q with row indices in block P^ and column indices in block Pq. We 
denote by and respectively the minimum and maximum eigenvalue 
of a square matrix Q. For the sake of simplicity we denote the r—th component 
of the gradient as Vf{x)r = and as Vp/(x) G R^'^l the subvector of the 

gradient made up of components with r G P. We denote by J- the feasible 
set of problem o, namely 


P = {x G R" : y^x = 0, 0 < x < Ce}. 

Note that all the results that we report in the sequel hold also in the case of 
feasible set P = {x G R" : y^x — b, 0 < x < Ce}, where y G R” and 6 G R, 
but for sake of simplicity we refer to the case 6 = 0 and y G {—1,1}"- 

2 Optimality Conditions and Preliminary Re¬ 
sults 

Let us consider a solution x* of problem O- Since constraints are linear and the 
objective function is convex, necessary and sufficient conditions for optimality 
are the Karush-Kuhn-Tucker (KKT) conditions that state that there exists a 
scalar s such that for all indices r G {1,..., n}: 

V,f{x*)r + syr>0 ifx;=0 

V,f{x*)r + syr<0 iixl=C (2) 

V/(x*)r + syr=0 iiQ< xl<C. 

It is well known (see e.g. m) that KKT conditions can be written in a more 
compact form by introducing the following sets 

Pp(x) := {r C {1,..., n} : Xr < C, yr = 1, or Xr >0, yr = -1}, 

Iiowi'x.) := {r C {I,... ,n} : Xr < C, yr = -1, or Xr >0, yr = 1}. 
Assuming that Iup{x*) ^ 0 and Iiowix.*) ^ 0, then we can rewrite ([2]) as 

. V/(x*), ^ . V/(x*), ,, 

m(x ) = max-< mm- = Mix. ). (3) 

»'e/up(x*) yr re/iom(x*) yr 

By the convexity of problem o, we can say that x* is optimal if and only if 
either /„p(x*) = 0 or Iiowi^*) = 0 or condition Q holds. 

Such a form of the KKT conditions is the basis of most efficient sequential 
decomposition algorithms for the solution of problem ©■ In decomposition 
algorithms the sequence {x^} is obtained by changing at each iteration only a 
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subset of the variables, let’s say xp. with C {1,..., n}, whilst the other x_p. 
remain unchanged. Thus the sequence takes the form 

xfc+i =x'= +a'=d^ 


where is a sparse feasible descent direction such that ||d^||o = \Pi\ with 
I Pi I << n and represents a stepsize along this direction. Whatever the 
feasible direction d^ is, since the objective function is quadratic and convex, 
the choice of the stepsize can be performed by using an exact minimization of 
the objective function along d^. Indeed, let ,3 > 0 be the largest feasible step 
at x^ £ P along the descent direction d^ then 


a 


k . 


d'=^Qd'= ’ ^ j ■ 


(4) 


Sequential decomposition methods differ in the choice of the direction d^, or 
equivalently in the choice of the so called working set Pp 

Sequential Minimal Optimization (SMO) methods uses feasible descent di¬ 
rections d^ with ||d^||o = 2 which is the minimal possible cardinality due to 
the equality constraint. In a feasible point x £ P a feasible direction with two 
nonzero components is given by 




= 


I 

Vr 

0 


if r = z 

if 7* = j ; ^ 1, . . . , 7Z. 

otherwise 


(5) 


for any pair (z, j) £ /„p(x) x Iiowi^)- We say that a pair (i, j) £ /„p(x) x Iiowi^) 
is a violating pair at x if it satisfied also V/(x)^d^®-^4 < q. 

The exact optimal stepsize a > 0 along a direction d^®-!^ can be efficiently 
computed by noting that in 0 we have 

/3 = min{/3,,/3j}, (6) 


where 


Ph := 


Xh 

C -Xh 


if < 0 

if > 0. 


(7) 


Thus we get the value of the optimal stepsize a along a direction d*^®-^4 as 


a := min 


yfiVi - yfjVj 

Qii “f Qjj ^UiUjQij ) 


( 8 ) 


Among such minimal descent directions, i.e. violating pairs, a crucial role is 
played by the so called Most Violating Pair (MVP) direction (see e.g. [IS])- To 
be more specific, given a feasible point x, let us define the sets 


jMVP 

^up 


(x) 


* S iup{^) ■ * S arg max- 

^e/„p(x) Uh 


jMVP 

^low 


(x) 


|j e liow{yp 


7 £ arg min 


YIMh] 

Vh J 
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If X is not a solution of problem (P), then (jMVPiiMVp) S x /;^'^(x) 

is a pair, possibly not unique, that violates the KKT conditions at most and 
it is said Most Violating Pair (MVP). In the sequel, for the sake of notational 
simplicity, we assume that, for every feasible x, the MVP is unique as this makes 
no difference in our analysis. 

The direction cImvp G K" corresponding to the pair (fMvp, Jmvp) G x 

among all feasible descent directions with only two nonzero com¬ 
ponents, the steepest descent one at x. 

Now we are ready to introduce the definition of “most violating step”. Let 
xmvp = X -f OMvpdMvp with omvp obtained by ([HI) with i = imp, j = Jmvp- 

Definition 1 (Most Violating Step) At a point x G P, we define the “Most 
Violating Step” (MVS) 

Smvp 

Smvp{^) '■= — x|| = lojifiyl||dji/iy||. (9) 

In particular, since y* G {—1,1} we have that S'mvp(x) = IomvpIv^- 

We can state the optimality condition using the definition of MVS. 

Proposition 1 A point x* G P is optimal for problem if and only if either 
Iup(x.*) = 9 or Iiowfx*) =% or Smvp(x.*) = 0 . 

Proof. 1 As said above x* is optimal for problem o if and only if either 
Iup{x*) = % or Iiowip*) = 0 or condition ([3]) holds. Therefore we only have 
to show that, in the case in which lupfx.*) fi- 0 and fiowiP*) 0, the following 
holds: 

S/fyp(x*) =0 m(x*) < M(x*). 

Since Iup(x*) 0 and Iiowix.*) 0, we can compute a pair {iMvp,jMvp) £ 
^ ^Mvp = d^^ivpdMvp) as in (P. m{x*) < M(x*) is 

equivalent to inequality V/(x*)^d)^i,p > 0. By noting that d)fyp is a feasible 
direction at x* , then from we have /? > 0. Therefore by we can con¬ 
clude that V/(x*)^dJ[,p > 0 if and only if ci:)fyp = 0 and, in turn, if and only if 
Smvp{P*) = 0, so that the proof is complete. ■ 


3 A Parallel Decomposition Model 

In this section we introduce a parallel decomposition scheme for finding a solu¬ 
tion of problem O- The theoretical properties and implementation details are 
discussed in the next sections. The algorithm fits in a decomposition framework 
where, as usual, the solution of problem © is obtained by a sequence of solu¬ 
tion of smaller problems in which only a subset of the variables is changed. To 
fix notation, let x^ G P and consider a subset Pi C {1,... ,n}, so that x^ can 
be partitioned as x^ := (xp. ,x(ip.). The problem of minimizing over xp. with 
X-p. fixed to the current value x(ip. is: 

k 

min /Pi(xp,,x^pJ-b ^||xp,-x|,Jp, (10) 

xp,GP^. 2 . 

where a proximal point term with t/' > 0 has been added [20] , and the feasible 
set is 

Pp. := {xp. G : yp.xp. = yp.Xp., 0 < xp. < Cep.}. 
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Problem (na is still quadratic and convex with hessian matrix Qp^p^ + T^Ip. 
symmetric and positive semidefinite, and linear term given by £p. +Tj^Xp., where 


^Pi= X! QPiPj^%-^Pi- 


We denote Xp. as a solution of problem (ITUl) . which is unique either if > 0 
or if QpiPi is positive definite. 

The parallel scheme that we are going to define is not based on splitting 
the data set or in parallelizing the linear algebra, but on defining a bunch 
of subproblems to be solved by means of parallel and independent processes. 
Unlike sequential decomposition methods, the search direction is obtained 
by summing up smaller directions obtained by solving in parallel a bunch of 
subproblems of type (uni). 

Let us define a partition P = {Pi, P 2 , • ■ •} of the set of all indices (1,..., n}. 
By definition we have that Pi C\ Pj = % and UiPi = (1,..., n}. The basic idea 
underlying the definition of the parallel decomposition algorithm is summarized 
in the scheme below. 


Algorithm 1 Parallel Decomposition Model 
Initialization Choose x° G P and set fc = 0. 

Do while (a stopping criterion satisfied) 


5.1 (Partition definition) 

Set = (Pi, P 2 ,..., P/vfc} and set rf > 0 for all i = 
l,...,iV^ 

5.2 (Blocks selection) 

Choose a subset of blocks C P^. 

5.3 (Parallel computation) 

For all Pi G compute in parallel the optimal solution 
x)k of problem (nni. 

5.4 (Direction) Set d'^ G M" block-wise as 



if Pz G 
otherwise. 


( 11 ) 


5.5 (Stepsize) Choose a suitable stepsize > 0. 

5.6 (Update) Set x^+^ = x^ -|- a^d^ and k = k +1. 


End While 
Return x^. 


The scheme above encompasses different possible algorithms depending on the 
choice of the partition P* at S.l, the blocks selection at S.2 and the stepsize 
rule at S.5. 
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A widely used standard feasible point is x° = 0 , but of course different 
choices are possible if available. The choice of x° = 0 presents the advantage 
that also the gradient is available being V/(x°) = —e. 

Checking optimality of the current point x^ may require zero or first order 
information depending on the stopping criterion adopted. A standard stopping 
criterion is based on checking condition to(x^) < M (x^)—ry, for a given tolerance 
77 > 0. In this case the updated gradient V/(x^+^) is needed at each iteration. 
It is well known that for large scale problem this is a big effort due to expensive 
kernel evaluations. Indeed we have the following iterative updating rule 

V/(x'=+i) = V/(x'=) + a'= ^ 

Pi^J’^ h€Pi 

At S.l a partition 7^^ of n} is defined. We point out that both the 

number of blocks and their composition in can vary from one iteration to 
another. For notational simplicity we omit dependency of blocks Pi, P 2 ,..., Pj^k 
on the iteration k. As usual in decomposition algorithms, a correct choice of 
the partition is crucial for proving global convergence of the method. 

At S.2 a subset of blocks in is selected. These blocks are the only ones 
used at S.3 to compute a search direction according to (ED. The selection of 
blocks makes the algorithmic scheme more flexible since one can set the overall 
computational burden. 

At S.3 we obtain an optimal solutions Xp. of problem CHD for each Pi G . 
Note that Xp. satisfies the optimality condition 

[VpJPi(xppX^pJ+ rf(xp. -Xpjj'^dp, > 0 (12) 

for any feasible direction dp. at Xp.. The computational burden of this step 
consists in 

• computing vector £p. to construct the objective function of (fTOll for all 
blocks Pi G J^, 

• solving the \J^\ subproblems. 

These \J^\ convex quadratic problems can be distributed to different processes 
in order to be solved in a parallel fashion. 

At S.4 the algorithm computes search direction d^. 

At S.5 the algorithm computes stepsize . We show in the next section (The¬ 
orems mm and ED that, in order to have convergence of the algorithm, can 
be computed according to a simple diminishing stepsize rule or a linesearch 
procedure (including the exact minimization rule). 


4 Theoretical Analysis 

In this section we analyze the theoretical properties of Algorithm [TJ To this 
aim, we first introduce the definition of descent block and of descent iteration 
that are crucial for the following analysis. 

Definition 2 (Descent block) Given e > 0. At a feasible point x^, we say 
that the block of variables Pi Q {1,... ,n} is a descent block if it satisfies 

||xp, - XpJI > €Smvp{x^), (13) 
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where xp. is the optimal solution of the corresponding problem OLTid S^vpip^ ^ — 

II~ ^^11 ■ 

Whenever at least one descent block Pi is selected in at S.2 of Algorithm [1] 
we say that iteration fc is a descent iteration. 

Under the assumption that at least one descent block is selected for opti¬ 
mization at S.3 of the parallel algorithmic model, we will prove that by using a 
suitable at S.5 the sequence {x*^} produced by the algorithm satisfies 

lim S'mvp(x^) =0. (14) 

k—¥C<D 

We prove later on that the assumption is easy to achieve. We first consider 
the case when the stepsize is determined by a standard Armjio linesearch 
procedure along the direction 

Theorem 1 Let {x^} be the sequence generated by Algorithm]^ where < 1 
at S.5 satisfies the following Armjio condition 

fipc^ + < f{x^) + 6»a'=V/(x'=)'^d^ (15) 

with 9 € (0,1). Assume that for all iterations k 

(i) a descent iteration k with k < k < k + L for a finite L > 0 is generated; 

(ii) either rf > r > 0 or Qp^p^ >~ 0, for all Pi G . 

Then either Algorithm]^ terminates in a finite number of iterations to a solution 
of problem o or {x^} admits a limit point and it satisfies (Ill- 

Proof. 2 First of all we note that {x*} is a feasible sequence, in fact it is 
sufficient to show that for all k ifx^ is feasible then also x^+^ is feasible. Since 
for all Pi G it holds that ypAxp^ = yp^'^Xp., then we can write 

y-x'=+i= ^ yp.-(x^,+a'=(xp.-x^J)+ ^ ypjx% 

= Y. ypfix^F Y = y-x'^ = 0. 

Piey'' PitJ^ 

Finally by noting that for all Pi G it holds that 0 < xp. < C, then by the 
convexity of the box constraints and since a* < 1, we obtain 0 < < C and 

this proves the feasibility of the sequence. 

Note that \/Pi G the direction —dp. is a feasible direction at Xp. hence 
by dH we can write 

- [Vpi/Pi(xp.,x(ipJ -bTf(xp. -Xpjj’^dp. > 0. 
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( 16 ) 


Hence it holds that 


-vp./p.(x^^,x^pj-d^^ = (gp,p,x^, +4)^ (x^^ - 4) = 
{QpiPi^Pi + ^Pi) i^Pi — 4) ~ i^Pi ~ ^PiVQPiPii^Pi ~ ^Pi)+ 

(xp, - -^p,) = 

{QpiPi^Pi +^Pi) (^Pi ~^Pi)+ 

(4 - xpj'^Qp.p.Cxp, - 4) = 
-Vpjp.(4,x4)^d^^+ 
ilTel 

(xp, - - ^pJ ^ 

rfl4-x^,f+ 

(xp, - ^^YQp^p^yi),. - 4) ^ 

+A^rr^)iid^jp, 

and then we have 

Vpjp.(x^,,x4rdj.^ <-4 + A-; 

W^e denote by = minp^gj-fc (rf + > 0. By assumption (ii) there exists 

/9 > 0 such that > p for all k. 

From conditions (HU) and (II 3 we can write 


QPiPi 

min 


(17) 


/(x'=+^) - /(x'=) < -Y 


ik \\2 


(18) 


therefore sequence {/(x^)} is decreasing. It is also bounded below so that it 
converges and 

lim(/(x'=+i)-/(x4 = 0. (19) 

k—¥(yD 

Let 5c be a limit point o/{x^}, at least one of such points exists being T compact. 
Since, by the compactness of F and by the continuity of f, —oo < /(x) — /(x°) 
for all x'’ € F, then, by ra and (HU), we can write 


E 

A;=0 


a'^lld'^lp < +00. 


( 20 ) 


By condition (i) we can define an infinite subsequence {k}p- made up of only 
deseent iterations. Then, by (HnD, it follows that 


E_ 

k= 0 ,keK 


a'^lld'^lP < +00. 


( 21 ) 


A standard Armijo linesearch satisfying (1151) produces at each iteration an a* > 
0 in a finite number of steps (see W) and this ensures that 

OO 

= + 00 . ( 22 ) 

k—{),k^K 
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By m and ( 1 ^^ . we obtain 


liminf_ ||d^|| = 0 . 

k—^oc),k^K 


Now since each with k £ K contains a descent block at , by (USD we can 
conclude that 

liminf ^ Smvp{'^^) = 0 , 

k—^oo,kGK 

and then, since Smvp{^^) > 0 for all k, we can write 

liminf Sfivpi'^'^) = 0. (23) 

k—¥oc 

Suppose by contradiction that \im sup Smvp{^^) > 0, then for any^ > 0 suffi¬ 
ciently small we would have Smvp{^^) > 7 for infinitely many k and Suvpi^'^) < \ 
for infinitely many k. Therefore, one can always find an infinite set of indices, 
say Af, having the following property: for any n € TV, there exists an integer 
in > n such that Smvp{^^) < 2 ) > 7 . Then it is easy to see that 

x” V x®" and then J2k'=n «^l|d^|| > 0 for all n S TV. And then 


in-l 

liminf V a'=||d'=|| > 0 , 

n^J^,n—foo ' ^ 

k—n 


which is in contradiction with (EOl)- Then we finally obtain m- 


Note that since / is quadratic, condition (fT?|) can be guaranteed by using 
the exact minimization rule along direction d^: 


a 


k . 


max 


•|min 





(24) 


where 


_ • 

a := mm 



(x'®), ifdf <0 1 

C-{x% if df > 0 J ■ 


Note that in this case it is not necessary to impose < 1 since the feasibility 
is guaranteed by construction. 

If n is huge it may be cheaper to compute in an alternative way in order to 
save costly function evaluations. In particular we propose a diminishing stepsize 
strategy for which we give two different convergence results based on slightly 
different hypotheses. 


Theorem 2 Let {x^} be the sequence generated by Algorithm[^ where G 
(0,1] at S.5 satisfies the following condition 


q;^ —>■ 0 and = + 00 . (25) 

k=0 


Assume that for all iterations k 
(i) k is a descent iteration; 
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(ii) either rf > t > 0 or Qp^p^ >- 0, for all Pi G ; 

Then either Algorithm]^ converges in a finite number of iterations to a solution 
of problem o or {x^} admits a limit point and da holds. 

Proof. 3 Note that feasibility of the sequence {x*} and inequality la hold, 
see proof of Theorem [7J 

For any given k > 0 we can write (Descent Lemma 

( k\2.\Q 

/(x'^+i) - /(x'^) < a'=V/(x'=)^d'= + . (26) 

By using (ii) and la we can rewrite inequality dSH); 

/(x'=+i) - /(x'^) < lld'^f, (27) 

where p > Q is the minimum among r and all the minimum eigenvalues of all the 
positive definite principal submatrices of Q. Since, by (HSI), ^ 0 it follows 
that there exist p > 0 and k sufficiently large such that for all k > k inequality 
da implies: 

/(x'=+i)-/(x'=)<-a'=p||d'=f. 

Since, as said in the proof of TheoremUl —oo < /(x) — /(x^) for all G P 
and any limit point x of {x^}, in a similar way we can write 

OO 

^a'=||d'=f <+cx). (28) 

k—k 

By da, = + 00 , and then we obtain 

liminf ||d^|| = 0. 

k—¥oo 

Now since each contains a descent block at x^, by da we can conclude that 

liminf Suvpi^^) = 0, 

k—^oo 

and the thesis follows under the same reasoning of the proof of Theorem]^ I 

As stated in Theorem [5J a diminishing stepsize rule requires all iterations 
to be descent. In certain applications (e.g. when variables are randomly parti¬ 
tioned), it could be useful to relax this condition, requiring that only a subse¬ 
quence of the iterations are descent, as well as for Theorem [TJ This is formalized 
in the next theorem where we assume the additional mild hypothesis of mono¬ 
tonicity of sequence {a^}. 

Theorem 3 Let {x^} be the sequence generated by Algorithm{l\ where G 
(0,1] at S.5 satisfies (j25l) . Assume that for all iterations k 

(i) a descent iteration k with k < k < k + L for a finite L > 0 is generated; 

(ii) either rf > r > 0 or Qp.p. >~ 0, for all Pi G ; 

(iii) 
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then either Algorithm\^ converges in a finite number of iterations to a solution 
of problem o or {x^} admits a limit point and (HI holds. 

Proof. 4 Following the same reasoning of Theorem 0 inequality (12811 holds. 
By condition (i) we can define an infinite subsequence {fc}^ containing only 
descent iterations. Then by (1281) it follows that 

OO 

^ < +00. (29) 

k^k,k^k 


We can write the following chain of inequalities 

OO L — 1 OO OO 

.L-k+h < L ^ a^->^ <L Yi 




k—k-\-l 


k^k,kGK 


where the equality is due to (HSI), the first inequality holds by (Hi) and the second 
inequality holds by (i) and (Hi). Then the thesis follows from the same reasoning 
of the proof of Theorem [IJ ■ 


5 Construction of the Partitions 


To make the results stated in the previous section of practical interest, the 
major difficulty is to ensure that an iteration is descent in the sense that at 
least one descent block, according to Definitionis selected. Next lemma gives 
a relation between the steplenght produced optimizing over a generic block Pi 
and the one produced optimizing over any violating pair (i, j) belonging to Pi. 
This result will be useful in order to practically build a descent block and it is 
used in Theorem U) In this section we use the simplified assumption that any 
principal submatrix of Q of order 2 is positive definite. 

Lemma 1 Assume that any principal submatrix of Q of order 2 is positive 
definite. Let x* be a feasible point for problem dm and let {i,j) G /„p(x*) x 
Iiowi'x.^)- Suppose that a block Pi C {1,... ,n} exists such that {i,j) C Pj. Let 
x^ be the unique solution of 


mm 


Then there exists a scalar e > 0 such that 


^ hhj) ^{l,...,n}\(iy)) ■ 


|x^.-x^J|>e-||x^-x'=||, 


where Xp. is a solution of problem m- 
Proof. 5 First we note that 




(30) 


(31) 


with d(®’'l-* defined in and a computed as in ([5]) . For the sake of notation let 
us set d = . 

Since {i,j) C Pj, two cases are possible: (a) Xp. + /rdp. ^ Pp. for all p > 0, 
or (b) p > 0 exists such thatXp, + pdp. G Fp^, that is dp. is a feasible direction 
at Xp.. 
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(a) By construction it holds that yji dp. = 0. Then it holds that for all fi > 0; 
yp.(x^. + AidpJ = yp.x^. = yp,x|.., 

where last equality holds sinee Xp. S iFp.. Therefore we can conclude that 
for all fi > 0: 

x^,+/xdp, ^ 

and then either or must be on a bound. In particular, supposing 

* J _ 

w.l.o.g. that is the component i the one on the bound, if dj > 0 then = C 
and then we can write 

0 < adj = Xj - Xj < C - x^ =Xj - Xj; 

otherwise dj < 0 then x^ = 0 and then 

0 > adj = x^ - xl >0 - xl =x'? - xl 

In both cases it holds that 


Therefore noting that \ad-i\ 
conclude that 



\adi\ < \x'l-x’l\. 

= |d| and that |d|\/2 = 

^J|>|d| = ^||x'=-x'=||. 


Ix'= - 


we can 


(b) Since dp^ is a feasible direction at Xp. and since Xp. is an optimal solution 
of problem (flUl) at x^, by (1121) . we can write 

\^PiIPi (xp.,x'ipj +rf(xp. -Xpjj’^dp, > 0. (32) 

Since x^ is a solution of dSOl), and being —da feasible direction for dsol) at 
x^, then, by the minimum principle and since {i,j) C Pi, we can write 

^Pifp (xppX%j"dp. < 0. 

And therefore by (1321) we can write 

[V p, fp, (x^^, x^pj + rf (x% - x^J] dp, > 

Vpjp. (x^px^pj'dp,. (33) 

By assumptions, a > 0 exists such that 

cr||xp, - Xp,|p < (Xp, - Xp,)'^(5p,p,(Xp, - Xp.) 

= '\yPifPi (xppX^p.) -Vp^fp. (xp,,x(ip,)] ddp,. (34) 
Then combining dMl) and dSH) we can write 

CT||xp, - Xp, Ip < rf (xp, - Xp,)^Q;dp,+ 

[Vp./p. (x^,,x%,) - Vpjp, (x|,^,x%,)]"ddp. < 

(rf+ ||Op.pJ|)||x^,-x^J|||ddpJ| = 

{'^i + ^2ax)l|Xp, - XpJIIIXp, - X^J|. 
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Therefore we obtain 


— X 


P. II — 




n-k _1_ 

T; -r An 


- Xp I = 


n-k I 

Ta -f- An 


llx'^ 


— X 


and finally we have the proof. 


Theorem 4 Assume that any principal submatrix of Q of order 2 is positive 
definite. Let x^ be a feasible point for problem © and let {i,j) be a pair of 
indices such that i S and j G /;otu(x*). Suppose that a block Pi C 

{1,..., n} exists such that {i,j) C Pj. Let x^ be the unique solution of problem 
dani) and suppose that e > 0 exists such that 

||x'' - x'^ll > eSMvpi^'")- (35) 


Then Pi is a descent block. 

Proof. 6 By Lemma\l\we know that (EH) holds. Therefore by combining (EH 
and EH we obtain the proof. I 

Theorem m shows that we can build a descent block at the cost of computing 
a pair that satisfies (1551) . Clearly the most violating pair does it, but it is easy 
to see that any pair that “sufficiently” violates KKT conditions can be used as 
well. 

Now we give a further theoretical result which guarantees that at each iter¬ 
ation of Algorithm [1] at least one descent block can be built. 

Theorem 5 Let x^ be a feasible, but not optimal, point for problem © then 
at least one descent block Pi C {1,..., n} exists. 

Proof. 7 By Proposition^ ifx^ is not optimal then /up(x^) ^ 0, Liow{^^) ^ 0 
and Smvp{'x.^) > 0. Therefore Pi = {iMVP,jMVp) is a descent block. ■ 


6 Global Convergence in a Realistic Setting 

So far we have proved that, under some suitable conditions. Algorithm [T] either 
converges in a finite number of iterations to a solution of problem © or the 
produced sequence {x^} satisfies EH- However the fact that 5 'mvp(x^) goes to 
zero is not enough to guarantee asymptotic convergence of Algorithm [1] to a 
solution of problem ©■ Indeed, this is due to the discontinuous nature of in¬ 
dices sets Lup and Iiow that enters the definition of Actually, this is 

a well known theoretical issue in decomposition methods for the SVM training 
problem. However, even in the case when the algorithm were proved to asymp¬ 
totically converge to an optimal solution, the validity of a stopping criterion 
based on the KKT conditions must be verified m- A possible way to sort¬ 
ing out these theoretical issues is to use some theoretical tricks. For example 
by properly inserting some standard MVP iterations in the produced sequence 
{x^} [15] or by dealing with e—solutions [T3]. All these theoretical efforts can be 
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encompassed in a realistic numerical setting. Indeed all the papers discussing 
about decomposition methods rely on the fact that the indices sets lup and 
how can be computed in exact arithmetic. In practice what it can actually be 
computed are the following e—perturbations of sets lup and how 

^upi^) := {r C {1 ,... ,n} : Xr < C - e, yr = 1, or Xr > e, yr = -1}, 

Iiowi^) := {r C {I,... ,n} : Xr < C - e, yr = -1, or Xr > e, yr = 1}, 

with e > 0. Consequently we can define at a feasible point x the following 
quantities 




max 


V./(x). 

Vr 


M'^(x) 


min 

rGif (x) 

low ^ ' 


V/(x). 

Vr 


As a matter-of-fact an effective optimality condition which can be used is 


to'^(x'=) < M'=(x'=)-hry, 


(36) 


where ry > 0 is a given tolerance. Note that any asymptotically convergent 
decomposition algorithm can actually converge only to a point satisfying (1551) . 
rather than ®. 

It is easy to see that /up(x) C /^^(x) and — how{^) for all e > 0. 

Furthemore in [^, it has been proved the following result. 

Proposition 2 Let be a sequence of feasible points converging to a point 
X € T. Then, there exists a scalar e > 0 (depending only on x) such that for 
every e G ( 0 , e] there exists an index k = for which 

/^p(x'') =/„p(x'') and = how{^'") forallk>k. 

This proposition allows to state that for k sufficiently large and e sufficiently 
small using index sets d^p(x) and f^fou,(x) is equivalent to use the exact ones lup 
and how and we have that 


m'^(x) = to(x) and M'^(x)=M(x), 

so that, for any e G ( 0 , e] and k > k, (1551) reduces to the concept of ? 7 -optimal 
solution introduced in M- However this is not true far from a solution and/or 
for a wrong value of e, being e unknown. Reducing e to the machine precision 
Emach is the best that we can do in a numerical implementation, so that one 
can argue that for e = Cmach if -^up(x) = 0 or /(/^(x) = 0 , a solution has been 
reached within the possible tolerance. 

Given a point G if, we consider the MVP e—step 5',^„p(x^) obtained by 
using and instead of Iup{x^) and how{x-^)- As a consequence 

of the definition itself, for any MVP e—direction ,, we get that the feasible 

e—stepsize defined as in (|5]) remains bounded from zero by e. 

It is easy to see that all results stated so far for Algorithm [1] are still valid 
if we consider the e—definition 5'^p(x*^) rather than 5 'mvp(x^). Furthemore we 
have the following result, that fill the gap of convergence. 
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Theorem 6 Let e > 0 and rj > 0 be given. Let {x*^} he a sequence of feasible 
points such that ^ 0, /fo„(x^) ^ 0 and 

lim 5'^i,p(x'") = 0. 

k—¥oo 

Then k > 0 exists such that, for all fc > fc, x^ satisfies (EH)- 
Proof. 8 By definition of S^^p{x'^) we get 

0= lim S^yp{x'") = V2 lim \a'f,yp^\. (37) 

k—^oo k—¥oo ’ 


Since by construction (3^ > e, by o we get that (EH) implies that k > 0 exists 
such that, for all k > k, we have —V/(x < p, which implies (ESI). ■ 

7 Practical Algorithmic Realizations 

Algorithm [ 1 ] includes a vast amount of specific strategies that may vary accord¬ 
ing to several implementation choices. Various alternative may be related to 
the blocks dimension, the blocks composition, the blocks selection, the way to 
enforce convergence conditions and the methods used to solve the subproblems. 
Different algorithms can be designed exploiting these degrees of freedom. In 
this section we discuss about some possible alternatives and we suggest some 
practical implementations of Algorithm [TJ 

The dimension of the blocks is a key factor for the training performances. 
It influences the way in which the subproblems can be solved so it should be 
carefully determined according to the dataset nature. Mainly we can consider 
two opposite strategies: blocks of minimal dimension (i.e. SMO-type methods) 
and higher dimensional blocks. In SMO-type methods we can take advantage of 
the fact that, for each block, the subproblem can be solved analytically. On the 
other hand each SMO-block may yield a small decrease of the objective function 
and slow identification of the support vectors, so that to get fast convergence 
the simultaneous optimization of a great number of SMO-blocks may be needed 
when dealing with high dimensional instances. This choice could be well suited 
for an architecture composed of a great amount of simple processing units, like 
that of the recent Graphic Processing Unit (GPU). Higher dimensional blocks 
require the solution of the subproblems by means of some optimization pro¬ 
cedure hence the solution of each block may need greater time consumption. 
On the other hand the decrease of the objective function and the identification 
of the support vectors may be faster so that less iterations should be needed. 
Hence this choice may be suitable whenever powerful but, generally, not many 
processing units are available. Algorithm [ 1 ] encompasses also the possibility of 
considering huge blocks assigned at each processor and using a decomposition 
method to solve the corresponding huge subproblems. This strategy essentially 
consists in iteratively splitting the original SVM instance into smaller SVMs, 
distributing them to available parallel processes and then gathering their solu¬ 
tions points in order to properly define the new iterate. 

An efficient rule to partition the variables into blocks is crucial for the rate 
of convergence of the algorithm. A lot of heurisitc methods (see e.g. [ia[ia[29]) 
have been studied to obtain efficient rules for constructing subproblems that 
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can guarantee a fast decrease of the objective function (e.g. by determining the 
most violating pair). Such methods usaully make use of first order informa¬ 
tion, implying that the gradient should be partially or entirely updated at each 
iteration, and this could be overwhelming in a huge dimensional framework. 
However practical implementations with a (partially) random composition of 
the blocks could be considered. 

Once we have determined a blocks partition of the whole set of variables, 
only a subset of the resulting subproblems may, in general, be involved in the 
optimization process. Indeed, we may further restrict the blocks used to update 
the current iterate by determining a subset . We need to keep in mind that 
the main computational burden is due to the gradient update. Indeed at each 
iteration, the gradient update must be performed by computing the columns of 
Q related only to those variables that are chosen to be in the selected blocks 
J^. Hence the choice of may take into account both the decrease of the 
objective function and the computational effort for updating the gradient. A 
minimal threshold on the percentage of objective function decrease associated 
to each block, with respect to the cumulative decrease of every block, could 
be a possible discriminant for a blocks selection rule in order to avoid useless 
computations. 

The practical effectiveness of the algorithm is highly related to the way to 
enforce convergence conditions stated so far. If, for example, we do not take 
into account the rule of taking at least one descent block every L iterations, we 
could take at each iteration the same partition of variables. This short-sighted 
strategy would be totally ineffective, since, in this case, the algorithm would 
lead only to an equilibrium of the generalized Nash equilibrium problem (see 
e.g. [6H8]) in which the players solve the fixed subproblems, but not to a solution 
of the original SVM, see the following example. 

Example 1 Let us consider a dual problem with four variables and in which 
Q — I, y = {1 1 — 1 — 1)^ and (7 = 1. The unique solution of this problem 
is X* = (1 1 1 1)^. Let x^ = (0 0 0 0)^ and suppose to consider a partition 
of two blocks: V = {(1,2), (3,4)}. Then the best responses are x^j^ j) = (0 0)^ 
and 4 ) = (0 0)^, but this implies that x^+^ = x^. Therefore if we do not 
modify V we will never move from the origin, and then will never converge to 
X* . However, being the origin a fixed point for the best responses of the two 
processes, it is, by definition, a Nash equilibrium for the game involving these 
two players. 

Regarding the choice of the steplenght we can use two different strategies: 
a linesearch or a diminishing stepsize rule. In the first case, as showed before, 
we can use the exact minimization formula (I24L In the second case a simple 
rule could be = ^, with ^ S (0,1], but different choices are also possible. 
Although preliminary tests showed that the exact minimization is more effective 
than any other choice, the diminishing stepsize strategy, besides being easy to 
implement, requires much less computations and this could be of great practical 
interest for high dimensional instances (training set with many samples and 
many dense features). 

As mentioned above, when the dimension of the blocks is more than two, an 
optimization algorithm is needed to obtain a solution of the subproblems. Due to 
its particular structure (convex quadratic objective function over a polyhedron), 
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a lot of exact or approximate methods can be applied for the solution of problem 
m- We simply point out that, whenever the dimension of the blocks is so big 
that each block can be considered a sort of smaller SVM, a slight modification of 
any efficient software for the sequential training of SVMs (like LIBSVM) could 
be used to perform a single optimization step. 

As a matter of example, we propose a SMO-type parallel scheme derived 
from Algorithm [T] for which we developed two matlab prototypes. This real¬ 
ization, that we call PARSMO, is based on using a partition V of minimal 
dimension blocks thus performing multiple SMO steps simultaneously in order 
to build the search direction . Each SMO step is assigned to a parallel process 
that analytically solves a two-dimensional subproblem. Thus the computational 
effort of each processor is very light and communications must be very fast thus 
being suitable for a multicore environment. 

Algorithm 2 PARSMO 

Initialization Set x° = 0, V/° = —e, e>0, ?7>0 and 

k = 0. 

SgIscIj 

Do while > y) 

5.1 (Blocks definition) 

Choose {q - 1) pairs {{ 12 ^ 2 ), (* 303 ), ■ • ■, (*g,*g)}- 

Set = {(ii,ji),(* 2 ,j 2 ),---,(* 9 ,*«)}- 

5.2 (Parallel computation) 

For each pair {ih,jh) S compute in parallel: 

1 . kernel columns and Q>tj^ (if not available in the 
cache), 

2 . with defined as in dSl and th as in 

®. 

5.3 (Direction) d^ = 

5.4 (Stepsize) Compute the steplenght as in 

5.5 (Update) 

Set x^+^ = x^ -I- a^d^. 

Set V/'=+i = Vf + a'= ^ 

Set k = k 1. 

Select 

(*i,A)e/:p“^^(x'=)x/f’^^^^(x'=). 

End While 

Return x^. 

Starting from the feasible null vector x° = 0, which is a well known suitable 
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choice for SVM training algorithms beacuse allows to initialize the gradient V/° 
to —e, the algorithm selects at each iteration the most violating pair (zi, ji) S 

X and further (q—l) pairs that all together make up 

Search direction at S.3 is obtained by analytically computing stepsise th for 
the q subproblems of type m related to the pairs in Direction is simply 
computed by summing up all the SMO steps; it has ||d^||o = 2q with q> 1 which 
depends on the number of parallel processes that we want to activate. Finally at 
Step 4 steplenght that exactly minimizes the objective function along d* is 
obtained by (El; note that this step requires no further kernel evaluations. The 
same holds for the gradient update. Of course as standard in SVM algorithms, 
a caching strategy can be exploited to limit the computational burden due to 
the evaluation of kernel columns. 

In PARSMO it remains to specify how to select pairs forming blocks 
Since the most expensive computational burden is due to the calculation of ker¬ 
nel columns, we want to analyse the impact of a massive use of the cache with 
respect to a standard one. Indeed we propose two different matlab implemen¬ 
tations of PARSMO that use a cache strategy in two different ways. We would 
compare the performance with a standard sequential MVP implementation in 
order to analyze possible advantages of the PARSMO scheme. 

In the first implementation, the q — 1 pairs, in addition to a MVP, are 
selected by choosing those pairs that most violate the first order optimality 
condition, like in the SVM^^S^^ algorithm [T3]. Hence we select q pairs (ih,jh) S 
/'p(x^) X sequentially so that 

-2/*iV/(x''),i > -i/,,V/(x'')*2 > ••• > -2/*,V/(x'=)i,, 

and 

-j/jiV/(a'=)ji < <■■■< -j/j,V/(x'=)j,. 

In this case, although we can use a standard caching strategy, we cannot con¬ 
trol the number of kernel columns evaluations at each iteration that in the 
worst case can be up to 2q. The computation of kernel columns and 

G {1,..., g}, can be performed in parallel by the processors empow¬ 
ered to solve the subproblems. In this case the number of kernel evaluation 
per iteration would be of course greater than those of a standard MVP, but 
the overall number of iterations may decrease. Thus we keep the advantages 
of performing simple analytic optimization, as in SMO methods, whilst moving 
2q components at the time, as in SVM^*®^*. We note that reconstruction of the 
overall gradient can be parallelized among the q processors and requires 

a synchronization step to take into account stepsize Thus the CPU-time 
needed is essentially equivalent to a gradient update of a single SMO step. In 
this approach the transmission time among processors may be quite significant 
and this strictly depends on the parallel architecture. We refer to this imple¬ 
mentation as PARSMO-1. 

The second implementation selects the q — 1 pairs, in addition to a MVP 
(ho’i) e X exclusively among the indices of the columns cur¬ 

rently available in the cache. To be more precise, let C be the index set of the 
kernel columns available in the cache. The g — 1 pairs {ih,jh) in are selected 
following the same SVM*®®^* rule described above for PARSMO-1, but restricted 
to the index sets C O x C O //^^(x^). We refer to it as PARSMO-2. In 
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this case the number of kernel evaluation per iteration is at most two as in 
a standard MVP implementation. The rationale of this version is to improve 
the performances of a classical MVP algorithm by using simultaneous multiple 
SMO optimizations without increasing the amount of kernel evaluations. 

In order to have a flavour of the potentiality of these two parallel strategies, 
we performed some simple matlab experiments for the two versions PARSMO-1 
and PARSMO-2. All experiments have been carried out on a 64-bit intel-Core i7 
CPU 870 2.93Ghz x 8 . Both PARSMO-1 and PARSMO-2 make use of a stan¬ 
dard caching strategy, see [3], with a cache memory of 500 columns. We perform 
experiments with g = 1, 2,4 and 8 parallel processes. Clearly the case with < 7=1 
corresponds to a classical MVP algorithm with a standard caching strategy. It 
is worth noting that to preserve the good numerical behavior of PARSMO-1 
and PARSMO-2, it is necessary the use of the “gathering” steplenght . In 
fact, further tests, not reported here, showed that by removing the use of 
oscillatory and divergence phenomena may occur when using multiple parallel 
processes. This enforce the practical relevance of our theoretical analysis. 

The major aim of the experiments in this preliminary contest is to highlight 
the benefits of simultaneously moving along multiple SMO directions. We tested 
both PARSMO-1 and PARSMO-2 on six benchmark problems available at the 
LIBSVM site http : //www. csie .ntu. edu. tw/~cjlin/libsvmtools/datasets/, 
using a standard setting for the parameters ((7=1, gaussian parameter 7 = 
l/#features), see Table [TJ 


name 

#features 

^training data 

kernel type 

a9a 

123 

32561 

gaussian 

gisette scale 

5000 

6000 

linear 

cod-rna 

8 

59535 

gaussian 

real-sim 

20958 

72309 

linear 

rcvl 

47236 

20242 

linear 

w 8 a 

300 

49749 

linear 


Table 1: Training problems description. 


To evaluate the behavior of the algorithms we consider the “relative error” 
{RE) as 


RE = 


ir-/i 

1/1 


where /* is the optimal known value of the objective function. As regards 
PARSMO-1, for each problem we plot the RE versus 


i) the number of iterations (see Figure [T]); 


ii) the number of kernel evaluations per process, which is obtained by dividing 
the total number of kernel evaluations by the number of parallel processes 
involved (see Figure [5]). 
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Our results show that the larger q is, the steeper the RE decrease is. This 
emphasizes the positive effect of moving along multiple SMO directions at a 
time. 

As regards PARSMO-2, we note that, except for the MVP pair which can 
require the computation of the kernel columns and , each SMO process 
computes only the analytical solution of the two-dimensional subproblem, since 
kernel columns are already available in the cache. Thus, PARSMO-2 may pro¬ 
duce a cpu time saving even by running the algorithm in a sequential fashion. 
In order to show the cheapness of its tasks, in Figure |3] we plot RE versus the 
CPU-time consumed. 





real-sim 


rcvl 


w8a 


Figure 3: PARSMO-2: Relative Error versus (sequential) CPU-time. 

PARSMO-2 with q > 1 seems to be faster than a classical MVP algorithm. 
This is due the use of multiple search directions without suffering from an in¬ 
crease of time consuming kernel evaluations or from the need of iterative solu¬ 
tions of larger quadratic subproblems. It is important to outline that PARSMO- 
2 achieves its good performances by combining a convergent parallel structure 
with an efficient sequential implementation, and it seems to be useful also in a 
single-core environment. 
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